Add error to point user to slurm resume log by hgreebe · Pull Request #676 · aws/aws-parallelcluster-node

hgreebe · 2025-10-29T13:14:30Z

Description of changes

Add error to point user to slurm resume log
ignore Flake8 rule B042 as it is a minor, and it is also affected by false positiveness B042 false positive PyCQA/flake8-bugbear#525.

Tests

Ran test_slurm_accounting with change to cause the InvalidParameter error

In slurmctld:
[2025-10-30T12:16:05.106] update_node: node compute-dy-cit-10 reason set to: (Code:InsufficientInstanceCapacity)Failure when resuming nodes - Check the slurm_resume log for EC2 error codes

In clustermgtd:
2025-10-30 12:26:48,861 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'compute': {'cit': ComputeResourceFailureEvent(timestamp=datetime.datetime(2025, 10, 30, 12, 16, 42, 904354, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired. Check the slurm_resume log for EC2 error codes.

In slurm_resume:
2025-10-29 13:39:59,768 - 8667 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 2 - Error in CreateFleet request (aa9a6ad3-7ac3-4745-a4d1-b8c178907b8b): InvalidParameter - Security group sg-0f2789bcd3e49cdf3 and subnet subnet-0c766771e11dab28c belong to different networks.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

gmarciani · 2025-10-29T16:55:29Z

-            self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
+            self._update_failed_nodes(
+                set(nodes_resume_list),
+                "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",


[minor] typo ec2 -> EC2

gmarciani · 2025-10-29T16:56:22Z

-            self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
+            self._update_failed_nodes(
+                set(nodes_resume_list),
+                "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",


InsufficientInstanceCapacity is. a common pcluster error code used in different places. Can we define a constant for it?

Same for the sentence Check....codes. Since we repeat that in many places we can have a constant

gmarciani · 2025-10-29T17:06:17Z

-            self._update_failed_nodes(set(nodes_resume_list), "InsufficientInstanceCapacity", override=False)
+            self._update_failed_nodes(
+                set(nodes_resume_list),
+                "InsufficientInstanceCapacity(Check slurm_resume log for ec2 error codes)",


[BLOCKING] I agree in making the log line more helpful, redirecting the user to the right log. However, doing it this way actually changes the error code from InsufficientInstanceCapacity to InsufficientInstanceCapacity(Check...codes), which ultimately can have consequences in the way we monitor ICE errors on the clister dashboard.
See

aws-parallelcluster-node/src/slurm_plugin/cluster_event_publisher.py

Line 27 in 1b4ba77

**{failure: "ice-failures" for failure in [*SlurmNode.EC2_ICE_ERROR_CODES, "LimitedInstanceCapacity"]},

(cherry picked from commit 84ec039)

(cherry picked from commit c84aeb5)

(cherry picked from commit bdc8706)

gmarciani · 2025-10-30T16:34:43Z

        log.info(
            "The following compute resources are in down state due to insufficient capacity: %s, "
-            "compute resources will be reset after insufficient capacity timeout (%s seconds) expired",
+            "compute resources will be reset after insufficient capacity timeout (%s seconds) expired. "


[Test] Can we reflect this change into the corresponding unit test. The same thing you did for the resume script.

hgreebe force-pushed the develop-test branch 2 times, most recently from 14ea243 to 881d029 Compare October 29, 2025 16:43

hgreebe marked this pull request as ready for review October 29, 2025 16:52

hgreebe requested review from a team as code owners October 29, 2025 16:52

gmarciani reviewed Oct 29, 2025

View reviewed changes

hgreebe added 3 commits October 30, 2025 08:53

Add error to point user to slurm resume log

5fd0c47

(cherry picked from commit 84ec039)

Fix unit tests

e71824a

(cherry picked from commit c84aeb5)

Fix code linter

3cc569f

(cherry picked from commit bdc8706)

hgreebe force-pushed the develop-test branch from 1345130 to 0f10a8e Compare October 30, 2025 12:53

Update CHANGELOG

dff72fa

hgreebe force-pushed the develop-test branch from 0f10a8e to dff72fa Compare October 30, 2025 12:56

Fix linter errors

73d764b

gmarciani reviewed Oct 30, 2025

View reviewed changes

Add unit test for logs in clustermgtd

f2f2556

gmarciani approved these changes Oct 30, 2025

View reviewed changes

hgreebe merged commit 9457ba4 into aws:develop Oct 30, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error to point user to slurm resume log#676

Add error to point user to slurm resume log#676
hgreebe merged 6 commits intoaws:developfrom
hgreebe:develop-test

hgreebe commented Oct 29, 2025 •

edited

Loading

Uh oh!

gmarciani Oct 29, 2025

Uh oh!

gmarciani Oct 29, 2025

Uh oh!

gmarciani Oct 29, 2025

Uh oh!

gmarciani Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hgreebe commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Uh oh!

gmarciani Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gmarciani Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hgreebe commented Oct 29, 2025 •

edited

Loading